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Abstract 

Thu paper examines the issues involved in the design 
of conference servers tknt support multiparty, multime- 
dia conferences. These servers, called Multipoint Control 
Units (MCUs) in the telephony world, coordinate the dis- 
tribution of audio, video, and ditn streams amongst the 
multiple participants in a videoconference. The MCU b 
responsible fcr the processing of video and audio so that a 
conference participant can hear and see one or more of the 
other participants in the conference. It is also responsible 
for handling and forwarding the data streams from the par- 
ticipants. This paper presents different approaches to the 
design of an MCU to implement these functions. It also de- 
scribes the design of a related device * a transcoding gate- 
way that enables conferencing between participants using 
different video/audio equipment. 



1 Introduction 

In recent years, with the emergence of improved com- 
munication technologies with wider coverage and ac- 
cessibility, videoconferencing ban become one of the 
major new growth applications. Videoconferencing 
standards are being developed and more and more 
videoconferencing products are appearing in the mar- 
ket. 

Videoconferencing solutions are currently evolv- 
ing from several directions. On the one side, there 
are the circuit-switched (e.g., Narrowband ISDN or 
the Switched-56Kbps phone lines) types of solutions, 
which being motivated by the telephony industry, can 
be likened to it. On the other side are the packet-based 
network (e.g., Ethernet and Token Ring legacy LANs) 



solutions, which are designed to carry real-time traf- 
fic over existing computer communications networks. 
The advent of ATM [1] (Asynchronous Transfer Mode) 
might eventually allow these two approaches to con- 
verge. 

Due to the stringent bandwidth, delay, and jitter 
requirements of real-time audio and video data, the so- 
lutions for the circuit switched and packet-based net- 
works differ considerably. These differences include 
the encoder/decoder (CODEC) technology for video 
compression and decompression, the methods used to 
guarantee network performance, and in the provisions 
within the end-stations to handle the real-time traffic. 



Videoconference 
point. 



may be point-to-point or multi- 



point* to- Point. In a point-to-point videoconferenc- 
ing a user is able to connect to only one other par- 
ticipant and communicate via video, audio, and 
shared data applications. 

Multi-Point* A multi-point conference involves 
more than two participants and multimedia data 
is multicast from each participant to all others. 

Each of the above scenarios involves the integrated 
communication of video, audio, graphics, and text 



Approaches used to support multipoint conferences 
may be categorised as either distributed or central- 
ised. In a distributed approach each end-station re- 
ceives the video and audio streams from all, or some, 
of the participating end-station sources in the con- 
ference. Each end-station then composes these mul- 
tiple incoming streams as desired. This approach is 
advantageous since it allows more flexibility and con- 
trol at each end-station and minimises the distance 
that streams need to travel between source and des- 
tination. It requires additional processing capability 
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Compressed Composition. The configuration in 
Pig. 1(c) performs a composition of the incoming video 
streams in the compressed domain. Techniques for 
performing this composition of compressed video are 
presented in [2]. The resulting video stream band- 
width may be a multiple of the incoming bandwidth. 
For example, if three incoming streams of bandwidth 
0 are composed into a single stream, where there is 
no overlap of the various sources in the composite, the 
resulting stream will have an average bandwidth of 
30. Hence, each station's incoming and outgoing link 
bandwidth* are asymmetric. 

Some video encoding algorithms consist of differ- 
ent types of frames, where some are reference frames 
which are encoded independent of any other frames, 
and predicted frames which are encoded based on refer- 
ence frames. The decoding engine treats these frames 
differently, and cannot decode a predicted frame with- 
out the reference frames to which it applies. For these 
video streams, in order for the receiving decoder to be 
able to handle the composite stream as a single stream, 
the reference frames of each source need be synchro- 
nised to be mixed in the same composite frame, and 
switching should occur at reference frame boundaries. 

Uncompressed Composition. The final configu- 
ration shown in Pig 1(d) also performs a composi- 
tion of the incoming video streams, but does so in 
the pixel domain. This requires that each incoming 
stream be decoded and that each outgoing stream be 
encoded. Separate encoders for each outgoing stream 
are depicted in the figure due to clock synchroniza- 
tion requirements between the sender and receiver. If 
the synchronisation requirement can be solved other- 
wise, a single encoder might suffice. By composing the 
incoming streams in the pixel domain, the resulting 
stream bandwidth can be adjusted to match different 
link speeds. 

Another level of complexity may be added to the 
composition configurations by allowing each destina- 
tion to select their own composition. 



3 MCUs and ISDN 

The International Telecommunication Union (ITU, 
formerly CCITT has defined several standards re- 
lated to audiovisual services over circuit switched pub- 
lic networks (the Narrowband ISDN). These recom- 
mendations relat- to the compression of audio and 



video, multiplexing (framing) of the multiple data 
types onto a single channel, and higher level signalling 
and control services. The H.320 document (3] gives an 
overview of services and the various associated stan- 
dards. Amongst these documents, the H.231 stan- 
dard [4] defines the functions of an MCU in the Nar- 
rowband ISDN environment. It classifies the func- 
tions as mandatory, to be supported by all compliant 
MCUs, and optional. The functions defined address 

• Framing. On the input side, the MCU must de- 
multiplex the incoming H.221 data [5] into audio, 
video, and application data. On the output side, 
after appropriate processing of the individual el- 
ements, it must regenerate an H.221 multiplexed 
stream for each connection. 

• Audio processing. Audio processing is manda- 
tory. However, the selection of functions provided 
can vary in range from simple format conversion 
and simple mixing, to selective mixing and pri- 
vate chats. 

• Video processing. Video processing is considered 
optional, but most MCUs would be expected to 
provide some video processing functions. 

• Data processing. Data processing is optional. 
Moreover, a meaningful exchange of data between 
disparate applications requires a standardisation 
of higher level protocols for collaborative applica- 
tions. 

3.1 Framing 

The MCU is required to support the multiplexing and 
demultiplexing of different data types using the H.221 
standard. In addition to data, there are several con- 
trol signals defined in the H.221 frame. Moreover, the 
values of some control signals are data as well as ter- 
minal dependent. Since the incoming data and ter- 
minal capability may be different from the outgoing 
one, the MCU must decode and stripe the incoming 
control signals and generate the appropriate outgoing 
control signals. We discuss the significance of some of 
the control signals below. 

H.221 defines the frame structure for audiovisual 
teleser vices using one or more 6 (64 Kbps) or HO (384 
Kbps) channels or a 9ingle Hll or H12 channel. This 
frame structure is used for several purposes such as: 
synchronisation of changes in configuration, error re- 
covery, and synchronitation of the multiple B or HO 
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connection! between the terminals. If « communica- 
tion channel uses more than one B connection, (the 
required data rale is more than 64Kbits/sec), it i. pos- 
sible that these multiple B connections are not aligned 
with each other. As such, the MCU has to provide the 
memory space to buffer the individual B connections 
and make use of the frame alignment signal (PAS) de- 
fined in the H.221 frame structure to synchronise and 
align them. 

In order to provide an end-to-end monitoring of 
quality for the connection, H.221 requires the source 

t0 J tU ut\ t ad iMert 4 bit " * CRC codes ^ «»«y 
other H.221 frame. An MCU, therefore has to examine 
and stnpe the CRC bits of each individual incoming 
data stream. After the processing, it is also required 
to generate the CKC code and to report a CRC error 
(if any of the incoming streams has failed the CRC) 
in the outgoing frame. 

The Bit-rate allocat.on signal (BAS) is used to 
transfer commands that describe the capability of a 
terminal. Some possible commands include audio cod- 
ing formats, video coding formats and their data rate 
This command is protected with a (16,8) double error 
correcting code. Since the BAS command may he dif- 

SSf k *Zl tb *>»">™**™4 outgoing frame, an 
MCU should be able to .nterpret. execute, and stripe 
the incoming commands. It must also have the abil- 
ity to convert the incoming commands to appropriate 
outgoing command a. 



zu-O 



3.2 Audio processing 

The MCU is required to accept audio data in a variety 
of formats, with dtfferent data rates, such as: G 711 
- 64 Kbps (or 56 Kbps) PCM with A-law or u-Law 

g°7 9 T~£?' VI 1 ' 32 Kbps ADPCM * ntod ««. 

wo',, 8 qUal ' ty * udio at 4fi ' 01 64 Kbps. 
The MCU must be prepared to decode and mix these 
different streams, but it may also provide additional 
iunctions. 

• Simple mixing. This mandatory function com- 
bines the audio S ig na ] s from all the participants 
and produces an output audio signal that is dis- 
tributed to th«- participants. 

• Selective mixing. The simple mixing function 
does not scale well as the number of participants 
in the conference increases. The inaudible signals 
from a large set of participants may add up to 



UK) 




Figure 2: Audio processing functions. 

create a significant disturbance in the output sig- 
nal. To prevent this, the MCU can implement a 
selective mixing function and limit the number of 
input streams that are mixed. The input streams 
are monitored and selected for the mixing based 
on the signal level (a form of silence detection) 
This enhanced mixing function is, for the most 
part, transparent to the end-stations. It may be 
enhanced by user inputs to permanently select 
some input streams - the conference chair being 
a possible candidate. * 

• Private connections. Upon request, the MCU 
may provide a direct connection between two par- 
ties in the conference to facilitate a private con- 
versation that is sepaiate from the main discus- 
sion. Although this function is easily supported 
it requires additional signalling/cueing between 
the end stations and the MCU. 

It is possible to implement most of these func- 
tions directly on the digital audio streams. Figure 2 
shows the different stages of audio processing: input 
pre-processing, selection, mixing, and output post- 
processing The input pre-processing consists of ex- 
panding the coded audio samples into linear samples 
It is noted that the different input streams may be 
coded using different standards and the processing is 
input dependent. This computation may be a simple 
table lookup, as in the case of expansion of „-law or 

A*m>™? n,pleS, , 0r * m °" COmplex impression of 
ADPCM aamples. The selection process consists of 
threshold testing the linear samples to detect and dis- 
card silent channels. Given this pre-processing, the 
audio mixing process is then a simple addition of the 
linear samples. The final post-processing phase in- 
volves re-compression of the mixed output to match 
the output required for each connection. 

This process may be optimised in certain special 
" the *» inputs follow the same 

standard, to eliminate the input and output process- 
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ing and perform mixing directly on the input stream*. 
This technique has the potential of reducing the pro- 
cessing power required for the audio processing, but 
may result in some degradation in audio quality. 

3,3 Video processing 

The H.261 standard [6] specifies the structure of the 
video stream and it ensures interoperability between 
video codecs from different vendors. It defines a con- 
stant bit-rate video stream whose bandwidth require- 
ment is of the form p x 64 Kbps, where p is an in- 
teger in the range 1..31. Two frame sises are speci- 
fied: the Common Image Format (CIF) which contains 
352 x 288 pixels, and QCIF (quarter CIF) which con- 
tains 176 x 144 pixels. The video frame rate can vary 
between 7 and 30 frames per second. The standard 
specifies a canonical video stream as one which can be 
decoded by a reference decoder without causing any 
buffer overruns in the decoder. 

In order to preserve the quality of the video- 
conference, it is necessary to ensure that the video 
processing does not increase the end-to-end latency. 
Moreover , the video processing and audio processing 
must be tuned such that the audio-video synchronisa- 
tion in the original signal is maintained. 

Selection. The MCU selects one of the incoming 
video streams and transmits it to all stations in the 
conference. The selection may be automatic (voice 
activated) or under the control of the conference di- 
rector. 

The selection function can be achieved without 
the need for video decompression and re-compression, 
within certain limits. In the steady state, the video 
processor copies video data from the selected input to 
all outputs. Since the input video stream is a compli- 
ant stream, the output seen by the receiving stations 
is also a compliant stream. However, the process of 
switching from one video stream to another poses some 
problems since the motion vectors in the new output 
stream could be inadvertantly applied to the pixels of 
the old output stream. A possible technique that can 
be employed to accomplish the switching is described 
below. 

• The MCU requests the selected source to transmit 
a refresh frame. 

• It monitors the old source to detect the end of the 
current frame. After the end of the current frame, 




Figure 3: Video processing functions. 



it starts inserting dummy frames. (Alternatively, 
it can use the u freese frame" signal to stall the 
output side.) 

• When the requested refresh frame arrives from 
the new source the MCU starts sending the new 
frame. 

The video input data must be buffered in the MCU 
so that the delsy in the video path can be matched to 
the delay in the audio path. 

Mixing. Video selection has the limitation that the 
user is able to see only one person in the multi-party 
conference. Video mixing (or composition) may be 
used to provide the user with video images of more 
than one participant. For example, a 4-way video com- 
position can be used to display four participants in a 
single video stream. Moreover, it is not necessary to 
modify the end station in any form since the MCU 
produces a single composite stream. 

In the general form, the mixing process consists 
of the following phases: decompression, composition, 
and compression (see Figure 3). In order to imple- 
ment an n-way composition, it requires n decompres- 
sion units, a pixel composition unit, and one compres- 
sion unit. 

In some special cases, it is possible to con- 
struct a video processor which requires only one 
coder/decoder. Consider the situation in which the 
conference participants use QCIF images for the video. 
The MCU can implement the video composition in the 
following manner (see Figure 4). 

1. Select a subset of the input sources (at most four) 

2. Tile the QCIF inputs to form a CIF image 

3. Decompress the compressed CIF image into pixel 
data 



360 



4. Sub^pk the c,p pixtl d4U „, t ^ 

5 £sr thf qcif mput Md *o «* 

Th» technique .ubatantiall, reduce, the complex** 
of the v.deo competition procew. It comJ^fk^ ' 
put OCIP • 11 compoeee the in- 

of dJaiLrfT 7 C ° mpreased fo ™. » the amount 

.We / ^P""* »«» W the oXtTciP 

nuge. An addit.onal benefit ofthi. approach* 

eJ^i techniqUe il " P«^We to «e«te a low- 

P~>i» .l^T^T' • <to, »>l"™i<» ««1 torn- 
to delay the audio «tream in the MCU. 



»*»g«r delay, in the video coder/decoder /COnpm i 
JJ-t A ,.te ab«,rption buffer ^ ffffEj 
ofthe oornpre-ion engine can be to provide feS 
°*ck *° th « comprearion engine to varv ti.™ 
«on algorithm to generate 

UN eolutmn. which operate over high wet^S 
worki cm eott «fWtn#-i^ _> - . 6 apeea net- 

uuwcver, it may be more coat effa-tiv* ^ , i 



3.4 



Data processing 



The data stream carried ,n the ISDN channel is denen 

4 Advanced Functions 
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Figure 4: Compressed video mixing. 
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Figure 5: Gateway Transcoder block diagram. 



362 



stream from each input port mu»t be decoded to « 
common (raw) format before it can be re-encoded ac- 
eordmg to the format required at the delation out- 
put port. A frame buffer is located between the de- 
code and encode engines where data is tored in its 
raw form. 

In a more general model, each incoming stream 
could be «nt out a. ,» u , tiple outgoin £» 
Afferent encoding formate. It would also be poS 
to receive multiple incoming stream, conforming to 

integrated outgoing steam. 

^n BMid ,!V he enCO * ng tr * Ml *ti<>*. there are other 
be made. The v.deo and audio data is, typically, in 

ST^dS" ^ UD,eSS * t^atio/be- 
tween two d.fferent comprewion algorithm, is powi- 

to a iVfoZ? *? ea,n mUSl ** be ^o-pr^ed 
r.th m h!f T *" d "-"""P"** by the new algo- 
uch as f?" mg 7 trWMmi »«l- ©the, translation. 

Zy air' T T ° r " m P~ ratio 
P*fo'med. These translation, may be 
requ,red due to end-.tat.on or network capabilities 
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Sample Transcoding Scenario 



To achteve the benefits of high quality acro» the LAN, 
and low-cost acre*, the WAN. the result is a heteroge- 
neous envtronment a. shown in Fig. 6. A. an e^lT, 
the LAN solution might use Motion-JPEG video com 
pression over UDP/IP and th* wi m - 
k. u ion ■' „ r and tne WAN connection could 
be H.320 with H.261 video compression over ISDN 
The connectivity between the LAN and the WAN is 
through a TCG. The TCG would have to tZlte 
Z, "r'^ S ' end,i,, 6 b «~« both envi- 
during the conference. Furthermore, the TCG would 

framed hS! ** ***** *~» * * 

framed H.261 v.deo stream, and vice-versa. The ISDN 

H 320 environment could view the TCG as an H.320 
behteTj r M^ 0re S ° PhlStiCated ' TCG C ° UW 

The concept of multiple MCU, i. quite interesting 
and can be extended to consider hierarchical Met* 
configuration* A. food for thought, Fig. 6 Urates 
a .eenar.o with three MCUs: one in each of the LAN 

ZZT'T' ° Be i0 th < IS ™ cloud n fhe 
MCU gathe " from Si 

jUt on, w,th,n ,ts environment and treat, the oth« 
MCUs as another end-station. In the configura fon 



•bowo in Fig. 6, MCUx may not be aware of MCU. 

other MCUs, but treat, them as end stations. Clearly 
more complex and interesting .cenario. are possible 

5 Summary 

In this paper we have described the functions required 
tc .support multimedia, multiparty WdeoconfereTc« 
These funchon. may be implemented in a decent 
.ed manner in e«h p«ticipating end-station Tor n a 
centred manner a. a multipoint control unit, w! 
MniT^A m T ^ pm- 

*Ld 2» • Part,Cdar empiiaiis 00 tbe narrow, 
band ISDN envuonment. We have al» printed 

isVnT fo ; tr r od i ng gatew ^ h —» ^Tub^ 

neSs 3 ^ (P4Cket ^ched) dau 
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Figure 6: Heterogeneous videoconferencing environment. 
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